Add CommentStrategy to optionally preserve leading comments#53
Conversation
By default comments preceding a query are stripped. The new opt-in constructor flag `preserveLeadingComments` keeps them as a prefix of the yielded query string instead, which is useful when comments carry meaningful annotations. The leading "skip" zone of the query regex is split into a part that strips pure leading whitespace and a captured `leadingComments` group that collects --, # and /* */ comments (with their original formatting) directly preceding a query. A comment between two queries is treated as preceding the following one; comments not followed by any query are dropped. Default behavior is unchanged. Co-Authored-By: Claude Code
Move the `preserveLeadingComments` constructor flag and the comment prepending logic into BaseMultiQueryParser, and apply the same regex restructure (strip pure leading whitespace, then capture preceding comments into a `leadingComments` group) to the PostgreSQL, SQL Server and SQLite parsers. The shared, dialect-agnostic behavior is now tested once in MultiQueryParserTestCase (so every parser runs it), including a two-chunk-boundary streaming test; MySQL keeps its # hash-comment specific cases. Default behavior is unchanged for all parsers. Co-Authored-By: Claude Code
It is read only inside BaseMultiQueryParser::buildQuery(); no subclass accesses it, so `protected` advertised an extension point that does not exist. Tighten the boundary to `private`. Co-Authored-By: Claude Code
PatternIterator declared its yielded match as array<mixed>, but preg_match results (no PREG_OFFSET_CAPTURE / PREG_UNMATCHED_AS_NULL) are always string-valued. Declaring array<array-key, string> reflects that guarantee and lets buildQuery() return/concatenate the match values directly, removing the (string) casts. Co-Authored-By: Claude Code
The behavior is documented in the readme; the typed, self-describing parameter does not need a duplicate prose comment. Co-Authored-By: Claude Code
There was a problem hiding this comment.
Pull request overview
Adds an opt-in preserveLeadingComments parser option so leading SQL comments can be retained as part of yielded query strings while preserving existing default behavior.
Changes:
- Adds shared constructor state and
buildQuery()logic inBaseMultiQueryParser. - Updates all dialect parsers to capture leading comments and prepend them when enabled.
- Adds shared and MySQL-specific tests plus README documentation for the new option.
Reviewed changes
Copilot reviewed 12 out of 12 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
src/BaseMultiQueryParser.php |
Adds the shared option and query-building helper. |
src/MySqlMultiQueryParser.php |
Captures leading MySQL comments, including #, and uses shared query building. |
src/PostgreSqlMultiQueryParser.php |
Captures leading PostgreSQL comments and uses shared query building. |
src/SqlServerMultiQueryParser.php |
Captures leading SQL Server comments and uses shared query building. |
src/SqliteMultiQueryParser.php |
Captures leading SQLite skip/comment regions and uses shared query building. |
src/PatternIterator.php |
Tightens iterator match type documentation. |
tests/inc/MultiQueryParserTestCase.php |
Adds shared preservation behavior and chunk-boundary tests. |
tests/cases/MySqlMultiQueryParserTest.phpt |
Adds MySQL hash-comment preservation coverage and constructor option support. |
tests/cases/PostgreSqlMultiQueryParserTest.phpt |
Passes the new option through test parser creation. |
tests/cases/SqlServerMultiQueryParserTest.phpt |
Passes the new option through test parser creation. |
tests/cases/SqliteMultiQueryParserTest.phpt |
Passes the new option through test parser creation. |
readme.md |
Documents preserveLeadingComments usage and behavior. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
Another option would be to consider the previous behavior where comments were stripped a bug, and and make this simple into bug fix without the extra option. |
|
The PR description is missing the most important part -> the motivation. What's the use case? Why is 'preserveLeadingComments' needed? Why don't trailing comments need to be preserved? Or maybe comments in between? (I guess, a comment before a comment won't get parsed). |
|
The SQL that is running is visible in observability tools. It's a common pattern to prepend a comment the query to make it visible in the observability tools (prepend vs append is important, because the full query is often not visible). -- some explanation for why we need this alter
-- or why we think it's safe to run at all
ALTER something_complicated_and_risky;The obvious problem is that given sequence SELECT 1; -- hello
SELECT 2; -- worldvs -- hello
SELECT 1;
-- world
SELECT 2;This PR assumes that the second case is far more common and that it's not worth trying to fine tune the heuristics to support the first case as well. And as consequence the trailing comments are not supported, since there is no query to attach them to.
Comment in between is a comment before query. Multiple comments before a query are supported. |
|
If we produce Isn't that a bit more universal? We may bundle such an iterator as well 🤔 |
|
That would be a major BC break. And you can't actually process this on user-side any better than this PR does it, unless we provide additional metadata such as line number and offset. That should all be possible but it would be much larger in scope and require a new major version. |
Well, first we assess the arch & design, then the means to achieve BC, right? E.g.,
Could you give an example when this is required to parse it correctly then? |
|
The problematic part on the user side is the detection if something is a comment, isn't it? |
If you want to resolve to which query a comment belong, you need the positional information to determine if the comment started on the same line as the previous query. Or if it started on a new line.
Yes. That's why I assume it is BC break, because you need to wrap the strings with |
|
Possible BC solution, but not pretty /**
* @param positive-int $chunkSize
* @return Iterator<($includeComments is true ? Query|Comment : string)>
* @throws RuntimeException
*/
public function parseFile(
string $path,
int $chunkSize = self::DEFAULT_CHUNK_SIZE,
bool $includeComments = false,
): Iterator;or /**
* @param positive-int $chunkSize
* @return Iterator<Query|Comment>
* @throws RuntimeException
*/
public function parseFileWithComments(
string $path,
int $chunkSize = self::DEFAULT_CHUNK_SIZE,
): Iterator;I prefer the solution in this PR to both of these. |
|
It seems you are focused on the API and BC, and not on the merit of what functionality we want to offer. So let me switch to this approach as well :) My Kotlin-like mind would offer this. https://phpstan.org/r/70b27277-20ff-446e-9790-22dbccc77392 /**
* @template O
*/
interface Mapper {
/**
* @param Iterator<int, Query|Comment> $iterator
* @return Iterator<int, O>
*/
function map(Iterator $iterator): Iterator;
}
/**
* @template O
* @implements IMultiQueryParser<O>
*/
class BaseMultiQueryParser implements IMultiQueryParser {
/**
* @param Mapper<O> $mapper
*/
function __construct(
private Mapper $mapper = new QueryOnlyMapper(),
) {}
/**
* @param Iterator<string> $stream
* @return Iterator<O>
*/
abstract public function parseStringStream(Iterator $stream): Iterator;
}EDIT: PHPStorm type inference sucks, tho. |
|
Yes, basically, I'm trying to come up with a solution that won't hardcode the very custom non-standardized behavior attaching comments to the following queries. Obviously, it makes very much sense for your company/use case. Yet, somebody may come and tell that the separator cannot be two empty lines, because then the meaning of the comment is not specific to the next query. Those POV are similarly valid as many another approchaes. That makes me feel I would abstract it differently. |
|
If we provide custom inner callback / interface, then we don't need to change the final type from Implemented as |
Parsers now tokenize input into a neutral Query|Comment fragment stream; a pluggable CommentStrategy collapses it into the final query strings. The default DropComments reproduces the historical comment-stripping output, while KeepLeadingComments prepends a query's leading comments. This keeps the opinionated comment-handling policy out of the parser core and leaves the public return type as Iterator<string>. Co-Authored-By: Claude Code
The name now states the mechanism (the comment is prepended to the following query, producing one combined string) and encodes the prepend-vs-append distinction that motivates the feature. Co-Authored-By: Claude Code
Co-Authored-By: Claude Code
Co-Authored-By: Claude Code
Co-Authored-By: Claude Code
hrach
left a comment
There was a problem hiding this comment.
👍 oh yes, this is nicely reasonable. not exposing the fragment type publicly, but simple enough to provide solution.
Summary
Adds the ability to retain SQL comments that precede a query, via a pluggable
CommentStrategyinjected into the parser constructor. By default comments are still stripped, so existing behavior — and the publicIterator<string>return type — are unchanged.Motivation
The SQL that runs is visible in observability tools, and it's a common pattern to prepend a comment to a query so it shows up there. Prepending (not appending) matters because the full query is often truncated in those tools:
-- why this risky alter is safe to run ALTER something_complicated;Previously the parser unconditionally stripped such comments.
Design
The parser now tokenizes input into a neutral stream of
QueryandCommentfragments; aCommentStrategycollapses that stream into the final query strings:Two strategies are bundled:
DropComments(default) — drops every comment; reproduces the historical output byte-for-byte.PrependLeadingComments— prepends the comments preceding a query as a prefix of that query.A user wanting a different policy (blank-line separators, trailing-comment handling, …) implements
CommentStrategythemselves; the non-standardized "a comment belongs to the following query" assumption lives only inPrependLeadingComments, not in the parser.Behavior (
PrependLeadingComments)--,/* */, and#for MySQL) directly preceding a query are preserved with their original formatting; only pure leading whitespace is stripped.DELIMITERdirective) is faithfully emitted as aCommentfragment, but dropped by both bundled strategies since there is no query to attach it to.BC
Iterator<string>;IMultiQueryParseris unchanged.DropComments⇒ no behavior change.readonly/newin default args).Tests
#- andDELIMITER-specific cases.Commentfragments even when no query follows (trailing comment, comment-before-DELIMITER).Co-Authored-By: Claude Code